1 Introduction

1.1 Background and motivation

Human resources are the most valuable asset in any country¹. They are the main reason behind the success or the failure of any organization. In fact, having an educated and competent manpower is the key driver to economic and social development. In this context, the importance of academic education has become undeniable. Therefore, it is crucial to invest money and time in order to study students’ academic performance and figure out effective ways to improve it.

Given the importance of the topic, it has been given particular attention in past research. In fact, many studies have been conducted in order to analyze the factors impacting students’ academic performance. While some studies focused on the psychological variables, such as Franck Amadieu & André Tricot’s research², other researchers have been interested in the impact of other elements such as mobility ³, gender and other socio-economic factors on students’ academic success.

Many reasons motivated us to choose this topic of research. In fact, as students, we are very passionate about the educational field. Thus, we want to provide through this project a detailed analysis that can be used as a reference guide for leaders working in the educational field. Mainly, we want to help schools and universities to have a better understanding of the factors influencing students’ academic performance in order to improve their decision-making processes, students’ success rate and eventually their overall organization.

Source: ¹ Gestion des ressources humaines,Jean-Marie Peretti, 2004. ² Psychological factors which have an effect on student success ,2015. ³ La migration pour études : Regards d’intervenants sur l’accueil et l’intégration des nouveaux étudiants »,2009.

1.2 Project objectives

The aim of the project is to understand the evolution of secondary academic performance in France. Our study will mainly focus on 3rd grade students (equivalent to 11th grade in Switzerland) and their results on the Diplôme National du Brevet (DNB) by school.

First, we will observe whether there are improvements or, on the contrary, deterioration in admissions of DNB over the years. From this dataset, we will also make comparisons, particularly at the geographical level, and an analysis of the success rate in terms of distinction for each school.

Then, we will try to understand if there is a correlation between academic success and some socio-economic factors, such as the type of accommodation, the single-parent families rate, and the involvement of schools in students’ physical and sports practice. Finally, despite these factors, we will investigate whether the COVID-19 pandemic has had a direct negative impact on students’ school performance.

1.3 Research questions

  • What is the evolution of student performance over time and across the different regions/departments of France?

  • Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?

  • Has the COVID-19 pandemic impacted student performance?

2 Data Sets

2.1 dnb_results

This dataset presents the results of the “diplôme national du brevet” by school, for schools in metropolitan France and for the overseas departments and regions. This data set contains 139’580 observations.

Variable Meaning
session Year of the exam session
school_id School identification number
school_type School type divided in six categories: COLLEGE, LYCEE PROFESSIONNEL, LYCEE, EREA, CFA, and AUTRE
establishment_name Name of the establishment
education_sector Education sector categorised as public or private
municipality_code Municipality code
municipality Name of the municipality
department_code Department code. It is to be noted that France has 101 departements.
department Name of the department
academy_code Academy code
academy_name Name of the academy
region_code Region code. It is to be noted that France has 18 administrative, regions
region Name of the region
registered Registered candidates
present Candidates present for the exam
admitted Candidates admitted
admitted_without Candidates admitted without distinction
admitted_AB Candidates admitted with distinction “Assez Bien”
admitted_B Candidates admitted with distinction “Bien”
admitted_TB Candidates admitted with distinction “Très bien”
success_rate “Success rate [Present]/[Admis] as a percentage”

Source of the data set

2.2 Generation 2024 data set

This data set gathers all schools which have been awarded the “Generation 2024” label. The objective of this label, developed in view of the Paris 2024 Olympic Games, is to develop bridges between the school world and the sports movement in order to encourage young people to take part in physical activity and sport. This data set contains 6’883 observations.

Variable Meaning
region Name of the region
academy Name of the academy
department Name of the department
municipality Name of the municipality
establishment Name of the establishment
school_id School identification number
school_type School type
education_sector Education sector categorised as public or private
postcode Postcode
adress Address of the establishment
adress_2 Additional address of the establishment
mail E-mail address of the establishment
students Number of students in the school
priority_education Indicates whether the school is located in a priority education network (REP) or a reinforced priority education network (REP+)
city_school Indicates whether the school is part of a city school
QPV Position relative to a priority neighbourhood of the city policy. It is a policy aimed at compensating for differences in living standards with the rest of the territory.
ULIS “Indicates whether the school offers a ULIS (Localized Unit for School Inclusion)”
SEGPA “Indicates whether the school has a SEPGA (adapted general and vocational education sections)”
sport_section Indicates whether the school has a sports section
agricultural_high_school Indicates whether the school is an agricultural high school
military_high_school Indicates whether the school is a military high school
vocational_high_school Indicates whether the establishment is labeled “vocational high school”
establishment_web Url of the description of the establishment page on the ONISEP website
SIREN_SIRET “SIREN/SIRET number of the establishment. SIREN is for Business Register Identification System in french.”
district Name of the district to which the school is attached
ministry Ministry responsible for the institution
label_start_date Start date of the “generation 2024” label. Format yyyy/mm/dd
label_end_date End date of the “generation 2024” label. Format yyyy/mm/dd
y_coordinate Y coordinate of the establishment, using the EPSG coordinate system
x_coordinate X coordinate of the establishment, using the EPSG coordinate system
epsg EPSG code of the coordinate system used to locate the establishment
precision_on_localisation Specification of the geographical location of the establishment
latitude Latitude
longitude Longitude
position Geographical position
engaging_30_sport Indicates whether the institution participates in the 30 minutes of daily physical activity programme

Source of the data set

2.3 Student housing Data set

This dataset records enrolment in secondary schools according to the type of accommodation for pupils: half-board, boarding school etc. This data set contains 32’096 observations.

Variable Meaning
year_back_to_school Year of the start of the school year
region Name of the academic region
academy Name of the academy
department Name of the department
municipality Name of the municipality
number School identification number
establishment_main_name Main name of the establishment
establishment_name Name of the establishment
school_type School type
education_sector Education sector categorised as public or private
students_secondary_education Students in secondary education
students_higher_education Number of students in higher education
external_students_secondary_education External students in secondary education
half_boarders_students_secondary_education Half-boarders in secondary education
boarding_students_secondary_education Boarding students in secondary education
external_students_higher_education External students in higher education
half_board_students_higher_education Half-board students in higher education
boarding_students_higher_education Boarding students in higher education

Source of the data set

2.4 Single-parent families dataset

This data set provides information about the single-parent families in each municipality. The census is made two years in a row every five years since 2007. This data set contains 606 observations.

Variable Meaning
geocode Departmental code from INSEE
libgeo Name of department
year Census year
sing_par Number of single-parent families

Source of the data set

2.5 Covid Data set

This is a time based data set that gives us information on the COVID 19 tests and results carried out by laboratories, hospitals, pharmacists, doctors and nurses. It is updated daily. On the 30th November, the data set contained 82’394 observations.

Variable Meaning
department_code Department code
test_week Week of the tests. Format yyyy-Sww
educational_level Description of the age group as [m-n], m and n being the lower and upper limits.
age_group Denomination of the age group. n-1 is used in this case excepet for the oldest group where 18 is used
pop Population
positive Weekly patients testing positive
tested Weekly patients tested
incidence_rate Incidence rate
positivity_rate Positivity rate
screening_rate Screening rate

Source of the data set

Loading of the data All but the single_parent dataset are CSV files with semicolons as separators. The single_parent data set is in excel format, so we have to use read_excel. We used “skip = 4” because the document includes extra header information rows.

DNB_par_etablissement <- read_delim(here::here("data/DNB-par-etablissement.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
Etablissements_labellises_generation_2024 <- read_delim(here::here("data/Etablissements-labellises-generation-2024.csv"),";", escape_double = FALSE, trim_ws = TRUE)
Hebergement_eleves_etablissements_2d <- read_delim(here::here("data/Hebergement-eleves-etablissements-2d.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
insee_rp_hist_xxxx <- read_excel(here::here("data/insee_rp_hist_xxxx.xlsx"), skip = 4)
covid_sp_dep_heb_cage_scol_2022_11_30_19h01 <- read_delim(here::here("data/covid_sp_dep_heb_cage_scol_2022_11_30_19h01.csv"), ";", escape_double = FALSE, trim_ws = TRUE)

3 Data Wrangling

We have realised that some wrangling are necessary for each data sets. We have established a checklist that we will go through for each data set. We have to :

  1. Translate the column names. As we have to rename all data sets in the same way, we have created a function. The function has a data frame and a vector as inputs. It checks if the length of the vector is correct, if so it returns a tibble with the column names renamed. Otherwise, the function gives an error message stating that the vector is not the right length.

rename_df <- function(df, x){
  if (ncol(df) == length(x)){
    names(df) <- c(x)
    df <- as_tibble(df)
  } else {
    stop("Vector is not the right length")
  }
}
  1. Make sure that all data are of the right type.
  2. Make sure that the time reference (year) are all aligned with the exam session.
  3. Add a column department_fr which will be harmonized between all data set in order to join them easily. We have also decided to keep only establishments in mainland France and Corsica. We wanted to focus on this part of France in order to get more comparable results. We note that including the results from all the overseas region would be interesting for further researches.
  4. and more. We need to make sure that the data set does not need any further specific wrangling.

3.1 dnb_results

  1. Translate the column names.

dnb_colnames <- c("session", "school_id", "school_type", "establishment_name", "education_sector", "municipality_code", "municipality", "department_code", "department", "academy_code", "academy_name", "region_code", "region", "registered", "present", "admitted", "admitted_without", "admitted_AB", "admitted_B", "admitted_TB", "success_rate_pct"
)
dnb_results <- rename_df(DNB_par_etablissement, dnb_colnames)
  1. success_rate is of the form xx,xx% we want it as a double of the form xx.xx

dnb_results[["success_rate_pct"]] <- as.double(gsub("%","",
                                               gsub(",",".", dnb_results[["success_rate_pct"]])))
  1. We need to harmonize the year variables of all the other data sets to match the logic of this one. The year is the year of the exam session (e.g academic period “2020-2021” is represented as 2021)
  2. We need to add the column department_fr and drop the overseas collectivities (COM).

dnb_results$department_fr <- stri_trans_general(dnb_results$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%       
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")

dnb_results <- dnb_results %>% 
dplyr::filter(!department_fr %in% c("Polynesie-Française","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "NA", "-"))
  1. We want to know the attribution rate of each distinction. It is needed in order to do some true comparisons between region without the same number of students.
dnb_results <- dnb_results %>% 
  mutate(without_pct = admitted_without/admitted*100,
         AB_pct = admitted_AB/admitted*100,
         B_pct = admitted_B/admitted*100,
         TB_pct = admitted_TB/admitted*100
         )
  1. Other data sets used for the project only go down to the department level. We therefore have to summarize all values to the department level for further merges and analyses between data sets.
dnb_results_dep <- dnb_results %>% 
  select(session, department_code, region:TB_pct) %>% 
  group_by(department_fr, session, region) %>% 
  summarise(registered = sum(registered),
            present = sum(present),
            admitted = sum(admitted),
            admitted_without = sum(admitted_without),
            admitted_AB = sum(admitted_AB),
            admitted_B = sum(admitted_B),
            admitted_TB = sum(admitted_TB),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE))

We can see the summarized data set dnb_results_dep below.

3.2 establishment_24

  1. Translate the column names.
est_24_names <- c("region", "academy", "department", "municipality", "establishment", "school_id", "school_type", "education_sector", "postcode", "adress", "adress_2", "mail", "students", "priority_education", "city_school", "QPV", "ULIS", "SEGPA", "sport_section", "agricultural_high_school", "military_high_school", "vocational_high_school", "establishment_web", "SIREN_SIRET", "district", "ministry", "label_start_date", "label_end_date", "y_coordinate", "x_coordinate", "epsg", "precision_on_localisation", "latitude", "longitude", "position", "engaging_30_sport")
establishment_24 <- rename_df(Etablissements_labellises_generation_2024, est_24_names)
  1. No problem for this data set
  2. We need to add two variables session_started and session_ended. Indeed as the label has a start and an End date we have to trace the first session and the last session where the establishment have the label generation 2024.Most labellisations start and end in January but a few start and end in middle of the year. Exams take place end of June, beginning of July. Therefore, we will consider labellization done in August and after as done for the next academic year.

establishment_24 <- establishment_24 %>% 
  mutate(session_started = case_when(month(label_start_date) <= 7 ~ year(label_start_date),
                                     month(label_start_date) >  7 ~ year(label_start_date)+1),
         session_ended = case_when(month(label_end_date) <= 7 ~ year(label_end_date),
                                   month(label_end_date) >  7 ~ year(label_end_date)+1)
         )
  1. We need to add the column department_fr.

establishment_24$department_fr <- stri_trans_general(establishment_24$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")

We can see on the map below, that the data set contains establishments from the overseas collectivities (COM) but from the French international schools as well.

As previsouly discussed we have decided to keep only data from mainland France. We had to make sure that we also removed the French international schools. We took the opportunity to remove unused variables.

establishment_24 <- establishment_24 %>% 
  dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>% 
  dplyr::filter(!department_fr == "NA")#"NA" and "-" makes sure that we have no more International schools. 

#establishment_24 has a lot of variables which we will for sure not use
establishment_24 <- establishment_24 %>% 
  select(-c(postcode:mail,city_school,QPV:SEGPA,establishment_web:ministry, precision_on_localisation))
  1. The “Diplome National du Brevet” is the diploma received at the end of “collège”. We have to keep establishments which are “collège” and drop all the rest.

  2. Other data sets only go down to the department level. We therefore have to create a simplified version for further merges between data sets. We will summarize by counting the number of establishment created each session by department. The end of the label period is not of interest for us as the first end date is on the 9.01.2021 which is categorised as session 2022 and we have results for dnb up until session 2021.

establishment_24_dep <- establishment_24 %>% 
  select(region, department_fr, session_started) %>% 
  group_by(department_fr, session_started) %>% 
  summarise(establishment = n())

We can see the summarized data set establishment_24_dep below.

3.3 student_housing

  1. Translate the column names.

housing_names <- c("year_back_to_school", "region", "academy", "department", "municipality", "school_id", "establishment_main_name", "establishment_name", "school_type", "education_sector", "students_secondary_education", "students_higher_education", "external_students_secondary_education", "half_boarders_students_secondary_education", "boarding_students_secondary_education", "external_students_higher_education", "half_board_students_higher_education", "boarding_students_higher_education")
student_housing <- rename_df(Hebergement_eleves_etablissements_2d, housing_names)
  1. No Problem for this data set
  2. We need to create a session variable as year_back_to_school refers to the beginning of the school year and not the exam session.

student_housing <- student_housing %>% 
  mutate(session = year_back_to_school + 1) %>% 
    select(year_back_to_school,session, everything()) #here just to order variables
  1. We need to add the column department_fr and remove the departments outside mainland France.
student_housing$department_fr <- stri_trans_general(student_housing$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%       
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")

student_housing <- student_housing %>% 
  dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>% 
  dplyr::filter(!department_fr == "NA")
  1. We are not interested by students from higher education and therefore can delete their variables.
student_housing <- student_housing %>% 
  select(-c(contains("higher")))
  1. We want to know the rate of each housing offering. It is needed in order to do some true comparisons between region without the same number of students.
student_housing <- student_housing %>% 
  mutate(external_students_rate = external_students_secondary_education/students_secondary_education*100, 
         half_boarders_students_rate = half_boarders_students_secondary_education/students_secondary_education*100,
         boarding_students_rate = boarding_students_secondary_education/students_secondary_education*100)
  1. Other data sets only go down to the department level. We therefore have to create a simplified version for further merges between data sets. We will summarize all values to the department level.
housing_dep <- student_housing %>% 
  select(session, region, students_secondary_education:boarding_students_rate) %>% 
  group_by(session, department_fr) %>% 
  summarise(external_students_secondary_education = sum(external_students_secondary_education, na.rm = TRUE),
            half_boarders_students_secondary_education = sum(half_boarders_students_secondary_education, na.rm = TRUE), 
            boarding_students_secondary_education = sum(boarding_students_secondary_education, na.rm = TRUE),
            students_secondary_education = sum(students_secondary_education, na.rm = TRUE),
            external_students_rate = mean(external_students_rate, na.rm = TRUE),
            half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE), 
            boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE))

We can see the summarized data set housing_dep below.

3.4 single_parent

  1. Translate the column names.
sg_parent_names <- c("geocode", "department", "session","sing_par")
single_parent <- rename_df(insee_rp_hist_xxxx, sg_parent_names)
  1. We have to change the scalar of the “session” and “single_par” variables into doubles.
single_parent[["session"]]<- as.double(single_parent[["session"]])
single_parent[["sing_par"]]<- as.double(single_parent[["sing_par"]])
  1. The session variable is already implemented in this data set.
  2. We need to add the column department_fr and remove the department outside of mainland France.
single_parent$department_fr <- stri_trans_general(single_parent$department, "Latin-ASCII") %>%
  str_to_title(.) %>%
  gsub("Du", "du", .) %>%
  gsub("De", "de", .) %>%
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
  str_replace_all("deux-Sevres", "Deux-Sevres") %>%
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%      
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")

single_parent <- single_parent %>% 
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>% 
  dplyr::filter(!department_fr == "NA")
  1. No need for further data wrangling for this data set

We can see the final table single_parent below.

3.5 covid_in_schools

  1. Translate the column names.
covide_names <- c("department_code", "test_week", "educational_level", "age_group", "pop", "positive", "tested", "incidence_rate", "positivity_rate", "screening_rate")
covid_in_schools <- rename_df(covid_sp_dep_heb_cage_scol_2022_11_30_19h01,covide_names)
  1. test_week will be treated in (3.). positive, incidence_rate and positivity_rate need to be doubles
covid_in_schools[["positive"]] <- as.double(gsub(",",".", covid_in_schools[["positive"]]))
covid_in_schools[["incidence_rate"]] <- as.double(gsub(",",".", covid_in_schools[["incidence_rate"]]))
covid_in_schools[["positivity_rate"]] <- as.double(gsub(",",".", covid_in_schools[["positivity_rate"]]))
  1. We need to create two new variables. The first will be the test_date categorizing each week. We chose the first day of the week. As we had the week number we had to select the week number and then for each year add seven days to the Monday of the first week of the year. The second variable is the session. A session is categorised from August to July of the next year. As our argument will be set on the month, we might have some test done the first days of august count towards the “wrong” session. The number of Covid cases in August are relatively low compared to the rest of the year and it represents at maximum 6 days of tests. Therefore we consider this margin of error to be satisfactory.
covid_in_schools <- covid_in_schools %>%
  mutate(test_date = case_when (as.numeric(substr(test_week, 1,4))== 2020
                                ~ lubridate::ymd('2019-12-30') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
                                as.numeric(substr(test_week, 1,4))== 2021
                                ~ lubridate::ymd('2021-01-04') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
                                as.numeric(substr(test_week, 1,4))== 2022
                                ~ lubridate::ymd('2022-01-03') + lubridate::weeks(as.numeric(substr(test_week, 7,8)))),
         session = case_when(month(test_date) <= 7 ~ year(test_date),
                             month(test_date) >  7 ~ year(test_date)+1))
                              
  1. Only department code. We need to input the column department_fr and region. We use the department_fr and region variables from dnb_results. We join the two datasets by department_code. To do this, we first need to match the two by removing the first character of the department_code from dnb_results.
reg_department <- dnb_results %>%
  select(c("department_code", "department_fr", "region")) %>%
  unique()

reg_department$department_code <- substring(reg_department$department_code, 2)
  
covid_in_schools <- right_join(x = covid_in_schools, y = reg_department, by = "department_code")
  1. We are only interested in the age group of 11 to 15. We use filter to keep only the students passing the DNB.
covid_in_schools <- covid_in_schools %>%
  filter(educational_level == "[11-15)")

We can see the final table covid_in_schools below.

3.6 Auxiliary data sets

We will use the ggplot France map for our visualizations

map <- map_data("france")

The region variable is in fact the departments. We rename it “department_fr” to fit in with the other data sets.

colnames(map)[5]<- "department_fr"

4 Exploratory data analysis

4.1 dnb_results

To explore this data set we have decided to start on a national level to analyse the global tendency. We will then go down a level to a regional analysis to compare the number of students and see which region performs better. An analysis at the regional level will then be performed to dig deeper into the success rate and the graduation rate for each mention. To be complete with our analysis, we will see the results by establishment for the best and worst performing establishments in 2020. We will use their results of 2006 in comparison.

4.1.1 National analysis

We created a data set summarizing the results at the national level. We simplified our visualisation process by pivoting the table. During the process, we created four new variables Candidates, Number_of_students, Mention_type and Rate.

France_results <- dnb_results %>% 
  group_by(session) %>% 
  summarise(registered = sum(registered),
            present = sum(present),
            admitted = sum(admitted),
            admitted_without = sum(admitted_without),
            admitted_AB = sum(admitted_AB),
            admitted_B = sum(admitted_B),
            admitted_TB = sum(admitted_TB),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>% 
  pivot_longer(c(registered, present,contains("admitted")),
               names_to = "Candidates",
               values_to = "Number_of_students") %>% 
   pivot_longer(c(contains("pct")), 
               names_to = "Mention_type",
               values_to = "Rate")

For this first graph, we plotted the Number_of_students and grouped them by Candidates.


p <- France_results %>% 
  ggplot(aes(x = session, y = Number_of_students, group = Candidates, color = Candidates))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    ggtitle("National DNB statistics") +
    theme_ipsum() +
    ylab("Number of students")

ggplotly(p, tooltip = c("x" ,"y"))

According to the graph, we see that the number of admissions is increasing over the years. In fact, there is 704’742 admitted students in 2021 compared to only 572’236 in 2006. Though, we notice a slight decrease in 2018 and 2019. The positive trend resumed after that till 2021 to reach its peak in 2020 with 716’237 admissions. Similarly, the registration number follows the same tendency. In fact, 729’803 students had registered in 2006 compared to 801’721 students in 2021, for an increase of 9.85%, which is the highest record that has been stated so far. Also, the presence number broke the record in 2021 as it reached 795’209 students.

For the the analysis of the success rate and the rate of achievement of honours. We plotted the Rate and grouped them by Mention_type.


p <- France_results %>% 
  ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    ggtitle("National DNB statistics") +
    theme_ipsum() +
    ylab("Rate in %")

ggplotly(p, tooltip = c("x" ,"y"))

The number of students that were admitted without a mention has been decreasing since 2006. It reached 284’774 students (53.85%) in 2006, compared to 154’095 students (25.11%) in 2021. The negative tendency calmed itself during the 2017-2021 period. As it is the case for all mentions, there is a complete shift in 2017. This will be analysed in further detail in the part one of our analysis. Contrary to the students graduating without honours, the number of students were admitted with the TB distinction is showing a positive pattern. It grew from 31’563 (4.49%) in 2006 to reach 204’523 (23.98%) in 2021. The pattern for the two other distinctions are relatively similar as they remained stable up until 2017 but the change was not as drastic asfor the first two distinction mentionned. The number of students that are admitted with a good mention show a positive pattern. In fact, the number has developed from 91’719 students (13.66%) in 2006 to 180’774 students (25.07%) in 2021. The number of students admitted with an AB distinction is increased slightly from 2006 to 2021. It developed from 164’180 (28%) students in 2006 to 165’350 students (25.83%) in 2021, representing a growth rate of 0.71%.

4.1.2 Regional Analysis

We started with a comparative analysis in order to compare the success rate between the regions. For this, we needed to create the charts below based on the dnb-Results dataset.

We created this bar plot in order to visualize the average success rate of each of the French regions. We used the function group_by in order to group our success rate data by region: We took the success_rate and the region, we computed the average success rate for each region and, we plotted them in the graph below.


p <- dnb_results %>% 
  select(success_rate_pct, region, department) %>% 
  group_by(region) %>% 
  summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>% 
  ggplot(aes(x = region, 
             y = success_rate, 
             fill = region)) + 
         geom_col() +
         scale_fill_viridis(discrete = TRUE) +
         labs(title = "Average success rate by region, 2006-2021", y = "Rate in %")+
         theme_ipsum() +
         theme(axis.title.x=element_blank(),
               axis.text.x=element_blank(),
               axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))

The success rate is following a positive pattern. It has been increasing since 2006 in all French regions. On average, French regions have performed similarly over the years. However, the region of Bretagne took the lead, achieving a success rate of 89.8%. The region Nouvelle-Aquitaine followed with a success rate of 86.6%. On the third position, Auvergne-Rhône-Alpes recorded a success rate of 86.4%.

Number of students per result and region during the period 2006-2021

To have a clearer idea about the regions that performed the best, we created the following bar plot. It is based on summarized data to help us visualize the regions that recorded the highest number of admissions so far. Students’ distinction has once again been taken into consideration. Hence, we could have an idea about the regions that recorded the highest number of admissions for each of the distinctions until 2021.

Number of admitted by region

p <- dnb_results %>% 
 select(admitted, region) %>% 
 group_by(region) %>% 
 summarise(admitted = sum(admitted)) %>% 
 ggplot(aes(x = region,
            y = admitted, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Number of admitted by region") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))
Admitted with zero mention

p <- dnb_results %>% 
 select(admitted_without, region) %>% 
 group_by(region) %>% 
 summarise(zero = sum(admitted_without)) %>% 
 ggplot(aes(x = region, 
            y = zero, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with zero mention") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))
Admitted with mention AB

p <- dnb_results %>% 
 select(admitted_AB, region) %>% 
 group_by(region) %>% 
 summarise(AB = sum(admitted_AB)) %>% 
 ggplot(aes(x = region, 
            y = AB, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention AB") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))
Admitted with mention B

p <- dnb_results %>% 
 select(admitted_B, region) %>% 
 group_by(region) %>% 
 summarise(B = sum(admitted_B)) %>% 
 ggplot(aes(x = region, 
            y = B, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention B") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))
Admitted with mention TB

p <- dnb_results %>% 
 select(admitted_TB, region) %>% 
 group_by(region) %>% 
 summarise(TB = sum(admitted_TB)) %>% 
 ggplot(aes(x = region, 
            y = TB, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention TB") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))

Based on the chart, Ile de France has recorded the highest number of admissions since 2006: 1 914 458 students got admitted in this region so far. The second position was assigned to Auvergne Rhône-Alpes, with a total number of admissions equal to 1 0276 655 students. On the third position, the region Hauts-de-France registered 99 0610 admissions since 2006. Also, Grand Est, Nouvelle-Aquitaine and Occitanic have performed similarly over the years, recording around 85 000 admissions over the period [2006-2021]. On the contrary, Corse has recorded the least number of admissions so far: Only 41 541 students were admitted in this region over the same time interval.

Taking the distinction into consideration, we notice that the regions with the highest number of admissions are the ones with the highest number of all the distinctions. As a matter of fact, Ile de France continues to lead the number of highest honour (TB), high honours (B), honours (AB) and standard pass (without) admissions.

The results are not surprising. Indeed, according to statistics, Ile de France is the most populated region, with an approximative number of 12 262 544 inhabitants. Thus, we expected it to take the lead in terms of the number of admissions.

Period 2006-2021 by region

In order to take the time component into account, a linear chart has been created: We used the group_by function in order to group our success rate data by session and region. We relied on this chart to analyze the evolution of the success rate throughout the years for each of the regions. Also, it enabled us to take the admission’s distinction into consideration and thus, analyze the development of students’ performance in all French regions.

Success rate

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


 p <- dnb_results %>% 
 select(success_rate_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = success_rate, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Success rate") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))
Admitted with zero mention

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


 p <- dnb_results %>% 
 select(without_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(zero = mean(without_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = zero, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with zero mention") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))
Admitted with mention AB

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


 p <- dnb_results %>% 
 select(AB_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(AB = mean(AB_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = AB, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention AB") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))
Admitted with mention B

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


 p <- dnb_results %>% 
 select(B_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(B = mean(B_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = B, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention B") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))
Admitted with mention TB

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


 p <- dnb_results %>% 
 select(TB_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(TB = mean(TB_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = TB, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention TB") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))

The standard pass admissions’ rate follows a negative pattern: It has been decreasing since 2006 for all the regions. It dropped from 57.7 % in 2006 to 24.9% in 2021 for Ile de France, from 49.6% to 24.4% for Nouvelle-Aquitaine and from 53.4 % to 22.6% for Auvergne-Rhône-Alpes.

In 2017 and 2020, the standard pass admission’s rate as well as the rate of admissions with honours dropped remarkably, hitting rock bottom in 2020 for all the regions. However, this tendency was disrupted in 2019, in which this number increased noticeably. In 2021, we noticed a reproduction of 2017’s positive pattern translating thus, a deterioration of students’ performance once again.

Conversely, the number of students succeeding with a great distinction (highest and high honours) has been increasing in all French regions since 2006, reaching its highest point in 2020. Similarly, we notice a strong negative evolution in 2019 that resumed in 2021.

4.1.3 Departmental Analysis

Following our national and regional analysis, we wanted to reduce the scope of our analysis further in order to have a better understanding of our data. Hence, we processed our data on a departmental level. In order to do that, we had to create the following box plots based on the dnb-results dataset.

4.1.3.1 Box plot Analysis per department

The first step of this analysis is to visualize the performance of all French departments over the period [2006-2021]. In order to do that, the following box plot has been created. For this, we needed to group our data by department and session. Then, we plotted the success rate of all French departments over the period [2006-2021].


dnb_pct_dep <- dnb_results %>%
  group_by(department, session) %>% 
  summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
            B_pct_dep = mean(B_pct, na.rm = TRUE),
            TB_pct_dep = mean(TB_pct, na.rm = TRUE),
            without_pct_dep = mean(without_pct, na.rm = TRUE),
            success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))
success_rate_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = success_rate_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = success_rate_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", y = "Success rate in %",
                title ="Success rate of each Department by session")

ggplotly(p, tooltip = c("text","text2"))
without_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = without_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = without_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", 
              y = "Rate of students with no mention in %",
              title ="Rate of students with no mention of each Department by session")

ggplotly(p, tooltip = c("text","text2"))
B_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = B_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = B_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", 
              y = "Rate of students with mention Bien in %",
              title ="Rate of students with mention Bien of each Department by session")

ggplotly(p, tooltip = c("text","text2"))
AB_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = AB_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = AB_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", 
              y = "Rate of students with mention Assez Bien in %",
              title ="Rate of students with mention Assez Bien of each Department by session")

ggplotly(p, tooltip = c("text","text2"))
TB_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = TB_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = TB_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", 
              y = "Rate of students with mention Très Bien in %",
              title ="Rate of students with mention Très Bien of each Department by session")

ggplotly(p, tooltip = c("text","text2"))

Based on our data analysis, the success rate is following a positive pattern. It has been increasing since 2006. In the same way, we notice an improvement of the students’ success rate in 2017 as well as a remarkable decline in 2019. A boost in the students’ performance was recorded in 2021 in which, the success rate in all departments has reached its highest level since 2006.

Also, the number of students succeeding with honours, or a standard pass distinction has been decreasing since 2006 for all France departments. It remarkably decreased in 2017 compared to the previous years, reaching its lowest level in 2020. However, the negative pattern resumed in 2021, in which the number of students graduating with honours or with a standard pass distinction started to increase.

In the same context, the number of students graduating with highest or high honours has been increasing since 2006 for all French departments. It peculiarly heightened in 2017 to reach its highest point in 2020, which reflects a noticeable improvement in the students’ performance.

Box plot Analysis of at the Department level

The following box plot was created in order to compare the performance of the best and “worst” department. It allowed us to visualize the difference between the highest achieving department, Paris in 2020 and one of the least achieving one, Eure et Loir. We chose to display the results of each establishment to see the dispersion in the results within one department.

We plotted the different success rates of the Paris department for the year 2006 and 2020 using the filter function. We proceeded the same way to plot the success rates of the Eure et Loir region over the same period.

Paris 2020

Paris <- dnb_results %>%
  select(school_id,establishment_name,department,session, contains("pct")) %>% 
  filter(department == "PARIS", session == "2020") %>% 
  pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
               names_to = "Mention_type",
               values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph

p <- Paris %>% 
 ggplot(aes(x = Mention_type, 
             y = Rate, 
             fill = Mention_type, 
             text = establishment_name,
             text2 = Rate)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+
        guides(fill = "none")+
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        labs( x= "", 
              y = "Rate in %",
              title ="Results for Parisian establishments in 2020")

ggplotly(p, tooltip = c("text","text2"))
Paris 2006

Paris <- dnb_results %>%
  select(school_id,establishment_name,department,session, contains("pct")) %>% 
  filter(department == "PARIS", session == "2006") %>% 
  pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
               names_to = "Mention_type",
               values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph

p <- Paris %>% 
 ggplot(aes(x = Mention_type, 
             y = Rate, 
             fill = Mention_type, 
             text = establishment_name,
             text2 = Rate)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+
        guides(fill = "none")+
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        labs( x= "", 
              y = "Rate in %",
              title ="Results for Parisian establishments in 2006")

ggplotly(p, tooltip = c("text","text2"))
Eure et Loir 2020

Eure <- dnb_results %>%
  select(school_id,establishment_name,department,session, contains("pct")) %>% 
  dplyr::filter(department == "EURE-ET-LOIR", session == "2020") %>% 
  pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
               names_to = "Mention_type",
               values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph

 p <- Eure %>% 
 ggplot(aes(x = Mention_type, 
             y = Rate, 
             fill = Mention_type, 
             text = establishment_name,
             text2 = Rate)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+
        guides(fill = "none")+
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        labs( x= "", 
              y = "Rate in %",
              title ="Results for Eure et Loir establishments in 2020")

ggplotly(p, tooltip = c("text","text2"))
Eure et Loir 2006

Eure <- dnb_results %>%
  select(school_id,establishment_name,department,session, contains("pct")) %>% 
  dplyr::filter(department == "EURE-ET-LOIR", session == "2006") %>% 
  pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
               names_to = "Mention_type",
               values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph

 p <- Eure %>% 
 ggplot(aes(x = Mention_type, 
             y = Rate, 
             fill = Mention_type, 
             text = establishment_name,
             text2 = Rate)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+
        guides(fill = "none")+
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        labs( x= "", 
              y = "Rate in %",
              title ="Results for Eure et Loir establishments in 2006")

ggplotly(p, tooltip = c("text","text2"))

Paris

On average, the success rate in all Paris establishments reached 93% in 2020. Representing thus, the highest performance recorded since 2006. Accordingly, the number of highest honour admissions outpaced the number of admissions with other distinctions.

Conversely, the success rate was much lower in 2006: It was around [ 40%- 60%], compared with [66%-100%] in 2020. In the same year, the number of students succeeding with a standard pass distinction was the highest, while the number of admissions with highest honour represented the smallest percentage.

Eure et Loire

In 2020, the results of the department of Eure et Loire followed the same pattern as those of the Paris department. It reached its peak in 2020, with an average success rate of 87.90%, compared to only 81% in 2006. Similarly, the number of admissions with highest honour outpaced the number of successes with other distinctions, in 2020.

However, in 2006, the number of students graduating with a standard pass distinction outstripped the one of those graduating with all the other distinctions.

Overall, in 2020, the Paris department outperformed the department of Eure et Loire, with an average success rate of 92.40% compared to only 87.90% for Eure et Loire. In the same context, the number of admissions with highest honour in the department of Paris outstripped largely the one of Eure et Loire. It was around [29.17% -55.21%] in Paris and around [14.52%-33.53%], in the department of Eure et Loire. Also, the number of standard pass admissions was remarkably higher in Eure et Loire. It represented 24.44% of the total number of admissions in 2020, compared to only 12.97% in Paris. Similarly, the department recorded an average rate of admissions with the honours distinction equal to 22.86%, compared to only 17.59% in Paris. We see that the dispersion pattern of the establishment is rather similar between the departments but changes between 2006 and 2020.

4.2 Establishment_24


p <- ggplot() +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") +
  geom_point(data = establishment_24, aes(x = longitude , y = latitude, text = department_fr, text2 = establishment),  size = 0.5)+
  coord_map() +
  labs(x = "", 
       y = "", 
       title = 'Establishment labelled "Génération 2024') +
  map_theme

ggplotly(p, tooltip = c("text", "text2"))

New establishment 2024 labellization per department and year

establishment_24_dep_map <- left_join(x = map[,-6], y = establishment_24_dep, by = "department_fr")
2017


p <- establishment_24_dep_map  %>% 
  filter(session_started == 2017) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = " New establishment 2024 labellization in 2017")+
  map_theme


ggplotly(p, tooltip = c("text","fill"))
2018


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2018) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = " New establishment 2024 labellization in 2018")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2019


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2019) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = " New establishment 2024 labellization in 2019")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2020


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2020) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = " New establishment 2024 labellization in 2020")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2021


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2021) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = " New establishment 2024 labellization in 2021")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2022


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2022) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = " New establishment 2024 labellization in 2022")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))

# We first need to count the number of establishment per department
nb_est <- dnb_results %>% 
  select(school_id, department_fr) %>% 
  group_by(department_fr) %>% 
  summarise(n_distinct(school_id))
# We include the values with the one from establishment_24 
establishment_24_dnb_dep <- left_join(x = nb_est, y = establishment_24_dep)

# We measure the rate of labelled establishment
establishment_24_dnb_dep <- establishment_24_dnb_dep %>% 
  group_by(department_fr) %>% 
  summarise(establishment = sum(establishment),
            nb_est = mean(`n_distinct(school_id)`)) %>% #use mean to keep the value of n_distinct(schoold_id) 
  mutate(rate = establishment/nb_est*100)

establishment_24_dnb_dep_map <- left_join(x = map[,-6], y = establishment_24_dnb_dep, by = "department_fr")

p <-establishment_24_dnb_dep_map  %>% 
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= rate), color = "black") +
  scale_fill_viridis(name = "Rate of labelled establishment in % ") +
  labs(x = "", 
       y = "", 
       title = "Establishment 2024 in 2022")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))

4.3 student_housing

p <- student_housing %>% 
  select(session, external_students_secondary_education, half_boarders_students_secondary_education, boarding_students_secondary_education) %>% 
  group_by(session) %>% 
  summarise(external_students_secondary_education = sum(external_students_secondary_education, na.rm = TRUE),
            half_boarders_students_secondary_education = sum(half_boarders_students_secondary_education, na.rm = TRUE), 
            boarding_students_secondary_education = sum(boarding_students_secondary_education, na.rm = TRUE)) %>% 
  gather("Housing_option", "students", 2:4) %>%  
  ggplot(aes(x=session, 
             y=students,
             fill=Housing_option)) +
    geom_bar(stat="identity", position=position_dodge())+
    ggtitle("Students per housing option in France")+
    xlab("Session")+
    ylab("Number of students")+
    scale_fill_viridis(discrete = TRUE)+
    theme_ipsum()

ggplotly(p, tooltip = "y")

Housing option rate in secondary education per department

external students

p <- housing_dep %>% 
  ggplot(aes(x = as.factor(session), 
             y = external_students_rate, 
             group = session, 
             fill = session, 
             text = department_fr,
             text2 = external_students_rate)) +
  geom_boxplot()+
  geom_jitter(width = 0.25, alpha = 0.5)+ 
  scale_fill_viridis()+
  guides(fill = "none")+
  labs( x= "",
        y = "Rate of external students in %",
        title ="Rate of external students of each Department by session")+
  theme_ipsum()

ggplotly(p, tooltip = c("text","text2"))
half-boarders students

p <- housing_dep %>% 
  ggplot(aes(x = as.factor(session), 
             y = half_boarders_students_rate, 
             group = session, 
             fill = session, 
             text = department_fr,
             text2 = half_boarders_students_rate)) +
  geom_boxplot()+
  geom_jitter(width = 0.25, alpha = 0.5)+ 
  scale_fill_viridis()+
  guides(fill = "none")+
  labs( x= "",
        y = "Rate of half-boarders students in %",
        title ="Rate of half-boarders  students of each Department by session")+
  theme_ipsum()

ggplotly(p, tooltip = c("text","text2"))
boarding students

p <- housing_dep %>% 
  ggplot(aes(x = as.factor(session), 
             y = boarding_students_rate, 
             group = session, 
             fill = session, 
             text = department_fr,
             text2 = boarding_students_rate)) +
  geom_boxplot()+
  geom_jitter(width = 0.25, alpha = 0.5)+ 
  scale_fill_viridis()+
  guides(fill = "none")+
  labs( x= "",
        y = "Rate of boarding students in %",
        title ="Rate of boarding students of each Department by session")+
  theme_ipsum()

ggplotly(p, tooltip = c("text","text2"))

4.4 single_parent

The first visualisation we wanted to do was a barplot that would show the evolution of the total number of single-parent families in France from 2007 to 2008. To do this, we had to isolate, in a dataset called sp_, the sing_par and session variables, then summarized the number of single-parents with “sum”. In order to remove the years that did not interest us, we had to use the filter function. We then were able to use geom_bar.

sp1_ <- single_parent %>%
  select(c("session", "sing_par")) %>%
  group_by(session)%>%
  summarise (sing_par = sum(sing_par, na.rm = TRUE))


p <- ggplot(data = sp1_, aes(x = as.factor(session), 
                             y = sing_par, 
                             fill = as.factor (session),
                             text = sing_par))+
  geom_col(stat = "identity" )+
  scale_fill_viridis(discrete = TRUE, "Session") +
  labs(x = "", y = "Number of single parent families")+
  theme_ipsum()

ggplotly(p, tooltip = "text")

Based on the barplot, we can make an overall analysis of the evolution of single-parent families in France over the years. Although we are missing several years, we can clearly see that the number of single-parent families has continued to increase since 2007. In fact, single-parent families numbered 2,427,110 in 2007, whereas in 2018, they were numbering 3,031,823. In just over 10 years, there has been a 20% increase.

Single-parent families by departments, 2007

We need to create a data set needed to create the maps.For this, we join the single_parent and map data sets using department_fr.

jmap_sp<- left_join(x = map[,-6], y = single_parent, by = "department_fr")
p <-jmap_sp %>% 
  filter(session == 2007) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= sing_par), color = "black") +
  scale_fill_viridis(name = "Number of single parent families") +
  labs(x = "", 
       y = "", 
       title = "Single parent families in 2007")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))

From the map, we realise that the departments with the most single-parent families are those with the most inhabitants (Bien dans ma ville, 2022). This is an unsurprising result which influence our choice to create variables for the data to become easily comparable.

To do this we first have to join single_parents and dnb_results_dep by department_fr and session to have the number of students per department. From this, we use the mutate function to create a new variable that divides sing_par by the number of students admitted by department. Thanks to this, we were able to create maps of France showing a proportion of the number of single-parent families per student admitted from 2007 to 2018. To create the interactive maps with nice colours, we used the function ggplotly.

sing_dnb <- left_join(x = single_parent, y = dnb_results_dep, by = c("department_fr", "session"))

sing_dnb <- sing_dnb %>% 
  mutate(single_parent_per_student_admitted = sing_par/admitted)
#Join with the map data set for the mapping. 
sing_dnb_map <- left_join(x = map[,-6], y = sing_dnb, by = "department_fr")

Single parent families per student admitted

2007


p <-sing_dnb_map %>% 
  filter(session == 2007) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student admitted in 2007")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2008


p <-sing_dnb_map %>% 
  filter(session == 2008) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2008")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2012


p <-sing_dnb_map %>% 
  filter(session == 2012) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2012")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2013


p <-sing_dnb_map %>% 
  filter(session == 2013) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2013")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2017


p <-sing_dnb_map %>% 
  filter(session == 2017) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2017")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2018


p <-sing_dnb_map %>% 
  filter(session == 2018) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2018")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))

It can be noted that the departments in the East have the lowest single parent rate (Mayenne, Maine-et-Loire, Vendée), while the departments in the Iles-de-France region (Paris, Seine- Saint-Denis, Marne), those around the Mediterranean Sea (Bouche-du-Rhone, Pyrenees, Herault) and those in Corsica (Haute-Corse, Corse du Sud) have the highest rates. It will therefore be important to keep an eye on these departments when analyzing the results of Diploma Brevet National (DNB).

4.5 covid_in_schools

First, we want to get an overview of COVID positive cases in France over the covid period. We select from covid_in_school the variables positive and test_date, group by test_date and summarise positive with sum. We then use ggplot with test_date as the x-axis and positive for the y-axis.

p <- covid_in_schools %>% 
  select(positive, test_date) %>% 
  group_by(test_date) %>% 
  summarise(positive = sum(positive)) %>% 
  ggplot( mapping = aes(x= test_date, y = positive)) +
  geom_line() +
  labs(title = "French  Covid-19 cases for the age group 11 to 15 years old  (2020-2022)", x = "Date", y = "Number of cases")+
  theme_ipsum()

ggplotly(p, tooltip = c("x","y"))

From the 2020 until the end of 2021, we notice waves of covid cases. Indeed, every 6 months, the number of positive cases increases to nearly 20’000 positive cases per day and then drops to around 1’000 positive cases per day. However, we observe a clear increase, with a peak of more than 235,000 positive cases per day, in January 2022 . After this increase, the number of positive cases fell significantly to remain around 2’000-10’000 positive cases per day. The 2022 academic year is therefore a year to keep an eye on, when comparing with the results of DNB.

For the second exploratory analysis, we want to compare the positive covid cases by regions. As population differs between departemnts, it is easier to compare using incidence_rate. To create the map, we select from the covid_in_schools dataset the variables department_code, incidence_rate, session, region, department_fr and test_date, do a group by region and test_date and summarise positive by doing an average We also use ggplotly with tooltip to display the important information.

p <- covid_in_schools %>%
  select(c("department_code", "incidence_rate", "session", "region", "department_fr", "test_date", "positive")) %>%
  group_by(region, test_date) %>%
  summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE), 
            positive = sum(positive)) %>% 
  ggplot() +
  geom_line(mapping = aes(x = test_date, y = incidence_rate, color = region))+
   scale_color_viridis(discrete = TRUE) +
    labs(title = "Covid-19 incidence rate for the age group 11 to 15 years old by region (2020-2022)", x = "Date", y = "Incidence rate") +
    theme_ipsum()

ggplotly(p, tooltip = c("x","y", "color"))

For the age group 11 to 15, Ile-de-France and Auvergne-Rhone-Alpes seem to be the two regions which were the most affected in the first waves. The Covid19 did not spare any department as we their incidence rate closely follow each others. The only slight exception is the wave in March-April 2022 where a few departments like Corse and Bretagne were more affected.

Incidence rate average per session

The last vizualization for the covid dataset is a mapping of incidence rate average by departments from session 2020 to 2023.

covidpos_dep <- covid_in_schools %>% 
  select(c("department_code", "incidence_rate", "session", "department_fr" )) %>% 
  group_by(department_fr, session) %>%
  summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE))

covidpos_dep <- left_join(x = map[,-6], y = covidpos_dep)
2020

It is to be noted that for a better visual interpretation the scale of is adapted to each graph.


p <- covidpos_dep %>% 
  filter(session == 2020) %>% 
  ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
  geom_polygon(aes(fill= incidence_rate), color = "black") +
  coord_map()+
  scale_fill_viridis(name = "Incidence rate average in 2020")+
  map_theme

ggplotly(p)
2021

It is to be noted that for a better visual interpretation the scale of is adapted to each graph.


p <- covidpos_dep %>% 
  filter(session == 2021) %>% 
  ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
  geom_polygon(aes(fill= incidence_rate), color = "black") +
  coord_map()+
  scale_fill_viridis(name = "Incidence rate average in 2021")+
  map_theme

ggplotly(p)
2022

It is to be noted that for a better visual interpretation the scale of is adapted to each graph.


p <- covidpos_dep %>% 
  filter(session == 2022) %>% 
  ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
  geom_polygon(aes(fill= incidence_rate), color = "black") +
  coord_map()+
  scale_fill_viridis(name = "Incidence rate average in 2022")+
  map_theme

ggplotly(p)
2023

It is to be noted that for a better visual interpretation the scale of is adapted to each graph.


p <- covidpos_dep %>% 
  filter(session == 2023) %>% 
  ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
  geom_polygon(aes(fill= incidence_rate), color = "black") +
  coord_map()+
  scale_fill_viridis(name = "Incidence rate average in 2023")+
  map_theme

ggplotly(p)

During the 2020 session, the average incidence rate is relatively low in all departments of France, except in Mayenne where the incidence rate is 17.96, which is the highest rate.

During the 2021 session, the scale indicates a clear increase in the proportions of average incidence rate in each of the departments, ranging from 246.9 to 52.6 cases per department. The western part of France is much more affected than the eastern part. We can hypothesise that this increase may be due to the fact that these are the departments closest to the borders and other countries.

During the 2022 session, we can observe an overall increase in the incidence rate in each of the departments. Aveyron, Landes, Pyrénées-Atlantiques and Corse du Sud are the departments with the highest rates, with a maximum of 1201.

During the 2023 session, the incidence rate fell significantly, particularly in Ile de France. This is because the school year is not over yet. La Creuse is the department with the highest rate with 420.

5 Analysis

5.1 What is the evolution of student performance over time and across the different regions/departments of France?


p <- France_results %>% 
  ggplot(aes(x = session, y = Number_of_students, group = Candidates, color = Candidates))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    ggtitle("National DNB statistics") +
    theme_ipsum() +
    ylab("Number of students")

ggplotly(p, tooltip = c("x" ,"y"))

talk about the new reform and sumarise what we have seen in eda

5.2 Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?

5.2.1 establishment_24

We first mapped the establishment labelled Generation 2024 onto a map of the average success rate for the brevet during the period 2006-20021. This visualisation helped us determine whether a clear link could be made or if further analysis need to be performed. To create the map, we first had to calculate the average success rate and join the result with the map data set.


result <- dnb_results %>% 
  select(department_fr, success_rate_pct) %>% 
  group_by(department_fr) %>% 
  summarise(success_rate = mean(success_rate_pct, na.rm = TRUE))


result_map <- left_join(x = map[,-6], y = result)

Then, with the help of ggplot, we have superimposed the map of the average success rate and the dotted map of the localisation of the labelled establishment presented in chapter 4.2.


p <-   ggplot() +
  geom_polygon(data = result_map, aes(long,lat, 
                            group = group, 
                            fill = success_rate, 
                            text = department_fr)) +
  geom_point(data = establishment_24, aes(x = longitude, y = latitude, text2= establishment), size = 0.5)+
  coord_map() +
  scale_fill_viridis(name = "Average sucess rate, 2006-2021")+
  labs(x = "", 
       y = "", 
       title = "Average success rate vs labelled Establishemnt") +
  map_theme
  
ggplotly(p, tooltip = c("text2", "text") ) #due to a strange behavior of ggplotly we must add the "text" variable in the tooltip otherwise the success rate map is only partially displayed. 

Zooming on the map, one can see that there is no clear pattern of correlation between high achieving departments and great number of labelled establishment. Indeed, one can see that for example, there are many labelled establishment around Paris and the success rate varies greatly. Building from this, we do not expect to have a strong relation between the success rate and the labelling or not of an establishment.

To test this statement, we did a linear regression at the establishment level. We used the Generation 2024 label as a boolean with 1 for labelled establishment and 0 for unlabelled ones. To create this variable we first had to add the school_id and session_started variables from establishment_24 to dnb_results. We used left_join by the variable school_id to keep all the establishments present in dnb_results. To add the variable est_24, we used the mutate coupled with the case_when to cover all cases. est_24 takes the value 1 when the session is greater or equal than session_started and the value 0 for all other cases which are when session is smaller than session_started or the establishment is not labelled and the session_startedvalue is NA.

est_24_join <- establishment_24 %>% 
  select(school_id, session_started)

est_24_dnb <- left_join( x = dnb_results, y = est_24_join, by = c("school_id"))

est_24_dnb <- est_24_dnb %>% 
  filter(session >= 2017) %>% 
  mutate(est_24 = case_when(session >= session_started ~ 1,
                            session <  session_started ~ 0, 
                            is.na(session_started) == TRUE ~ 0 ))

To have a visual representation of the possible relation between the success rate and the labelling, we used geom_bin_2d with est_24 as factor for the x-axis and success_rate_pctfor the y-axis. We used this way of plotting to display in a more visual way the dispersion of the establishments on the y-axis. To do the linear regression we used the lm function.

ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = success_rate_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

lm_est_24_dnb <- lm(data = est_24_dnb, success_rate_pct ~ est_24)
tab_model(lm_est_24_dnb)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 88.09 88.00 – 88.19 <0.001
est 24 -0.60 -1.00 – -0.19 0.004
Observations 42472
R2 / R2 adjusted 0.000 / 0.000

The result of this linear regression shows that there could be a potential negative aspect to have the Generation 2024 label. The R squared being below 0.000 makes this analysis highly unreliable as it would explain less than 0.01% of the variation in success rate.

We decided to continue our analysis and proceeded to perform a linear regression with each distinction. To performed them we followed the same steps as for the success rate regression analysis.

Linear regression for each mention

Without
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = without_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

lm_est_24_dnb <- lm(data = est_24_dnb, without_pct ~ est_24)
tab_model(lm_est_24_dnb)
  without_pct
Predictors Estimates CI p
(Intercept) 25.44 25.31 – 25.57 <0.001
est 24 -1.21 -1.77 – -0.65 <0.001
Observations 42458
R2 / R2 adjusted 0.000 / 0.000

AB
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = AB_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

lm_est_24_dnb <- lm(data = est_24_dnb, AB_pct ~ est_24)
tab_model(lm_est_24_dnb)
  AB_pct
Predictors Estimates CI p
(Intercept) 25.68 25.58 – 25.77 <0.001
est 24 -2.30 -2.70 – -1.89 <0.001
Observations 42458
R2 / R2 adjusted 0.003 / 0.003

B
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = B_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

lm_est_24_dnb <- lm(data = est_24_dnb, B_pct ~ est_24)
tab_model(lm_est_24_dnb)
  B_pct
Predictors Estimates CI p
(Intercept) 25.27 25.18 – 25.36 <0.001
est 24 -0.43 -0.81 – -0.05 0.027
Observations 42458
R2 / R2 adjusted 0.000 / 0.000

TB
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = TB_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

lm_est_24_dnb <- lm(data = est_24_dnb, TB_pct ~ est_24)
tab_model(lm_est_24_dnb)
  TB_pct
Predictors Estimates CI p
(Intercept) 23.61 23.47 – 23.75 <0.001
est 24 3.94 3.34 – 4.53 <0.001
Observations 42458
R2 / R2 adjusted 0.004 / 0.004

The R squared is unsurprisingly very low again meaning that we can not conclude any strong relation between students achievements and the Generation 2024 label. It is still to be mentioned that it affects positively the TB attribution rate and negatively the rest.

We can conclude that from our analysis we do not see any pattern that the Generation 2024 label influences has any influence on the results of the establishments. It is not that surprising as the objectives of the label is truly to promote the olympics in Paris in 2024 and link the schoold world to the sport’s one. It has no requirements on a amount of sport done during the school time. The total amount of sport practised by students is therefore probably not much greater in a labelled institution than in a non-labelled one, hence, the little influence.

5.2.2 student_housing

To analyse the impact of the housing offerings of the establishments on the results of their student, we first need to join the student_housing data set and the dnb_results one. We filtered out the unnecessary or repetitive variables of dnb_results such as the academy name and code or the education sector. We used inner_join by session, school_id and department_fr to keep only establishment appearing in student_housing.

dnb_prep_housing <- dnb_results %>% 
  select(session, school_id, registered:TB_pct)
housing_dnb <- inner_join(x = student_housing , y = dnb_prep_housing, by = c("session", "school_id", "department_fr"))

It is to note that the number of students between students_secondary_education and registered does not match as students of the four years of college are accounted for in students_secondary_education.

5.2.2.1 Linear regression

We evaluated the possibility of doing a multiple linear regression by measuring the correlation of the external_students_rate, half_boarders_students_rate and boarding_students_rate variables.


corrplot(cor(housing_dnb[c(17:19)]), col = viridis(256)) 

The correlation is unsurprisingly very high as if for example one establishment offers a meal for dinner most of the students will take it and very few will eat at home. Therefore, we will use single linear regression to asses whether there exist a link between student results and their habitual place of eating and living.

Linear regression for each offering

We simply used the lm function for each offering. We did not include graphs in the final report as the geom_point were completely overloaded.

External students

lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ external_students_rate)
tab_model(lm_housing_dnb)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 91.92 91.72 – 92.12 <0.001
external students rate -0.10 -0.10 – -0.09 <0.001
Observations 15803
R2 / R2 adjusted 0.070 / 0.070
Half-boarder students

lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ half_boarders_students_rate)
tab_model(lm_housing_dnb)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 82.04 81.64 – 82.43 <0.001
half boarders students
rate
0.10 0.10 – 0.11 <0.001
Observations 15803
R2 / R2 adjusted 0.084 / 0.084
Boarding students

lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ boarding_students_rate)
tab_model(lm_housing_dnb)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 89.43 89.28 – 89.57 <0.001
boarding students rate -0.06 -0.07 – -0.05 <0.001
Observations 15803
R2 / R2 adjusted 0.005 / 0.005

The p-value is excellent for each regression but as observed with student housing, the R squared is very low. However, we see some intersting variations in the intercept, a difference of 10%, that could prove to be insightful. To dig deeper into these variations, we performed a cluster analysis.

5.2.2.2 Cluster analysis

We performed the cluster analysis on the session 2021 and 2020 and results are very similar. We decided to leave the two analysis but going through one is sufficient to see the method used and understand the results.

2021

To perform the cluster analysis, we filtered the session 2021 and selected the variables department_fr, session, school_id, external_students_rate, half_boarders_students_rate, boarding_students_rate from the joined data set. We then removed th first three and scaled the rest of the data.

The goal of this cluster analysis is to define clusters relative to the offering of each establishment. Through this analysis, we aim to split the establishments by offering. As we have 3 different offering it would not make sense to do too many clusters. To define the right number of cluster for our kmeans clustering, we used the elbow method and concluded that we need 4 clusters. The ratio between the between sum of square and the within sum of square is good at 81.5%.


fviz_nbclust(clust, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) + #add line for better visualisation 
  labs(subtitle = "Elbow method")  #We can determine the optimal number of cluster, 4 clusters seems to be reasonable


# Compute k-means
km.res <- kmeans(clust, 4, nstart = 25)

# Visualize clusters using factoextra
fviz_cluster(km.res, clust,
             ggtheme = theme_ipsum(), 
             repel = TRUE)+
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)

We can see that the four clusters are well defined with cluster one, two and four being differiated by the x-axis and cluster 3 spanning wider on the x-axis and on the y-axis. One can see what the dimensions represent in the graph below. Cluster one and two have higher half boarder rate than group three and especially group four. Group four will represent establishment without any catering offerings for the students. Group three is establishments with at least some of the students in boarding schools.


fviz_pca_var(PCA(clust, graph = FALSE))

To be able to continue our analyisis on the cluster we just measured, we need to implement the data in the main data set. We gathered the cluster data in a new data frame then added the column of cluster to the data set used for the cluster and then joined it with inner_join to the main data set to keep the filter we had applied.


clust1 <- tibble(department_fr = names(km.res$cluster),
                            cluster = km.res$cluster)

clust_housing_dnb$cluster <- clust1$cluster

housing_dnb_2021 <- inner_join(x = housing_dnb, y = clust_housing_dnb)

To analyse the data by cluster we needed to summarise it by cluster. You can see the results in the table below with an added variable establishment which is the number of establishment present in each cluster.

clust_h_dnb_2021 <- housing_dnb_2021 %>% 
  group_by(cluster) %>% 
  summarise(establishment = n(),
            external_students_rate = mean(external_students_rate, na.rm = TRUE),
            half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE), 
            boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>% 
  round(digits = 2)

datatable(clust_h_dnb_2021, options =list(scrollX = "300px"))

We visually represented the results in a spider chart using the radarchart function.

op <- par(mar=c(0, 0, 0, 0))

radar <- clust_h_dnb_2021 %>%
  select(-c(cluster, establishment))

# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
 radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= viridis(4, alpha = 1) , pfcol= viridis(4, alpha = 0.2)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4)#, title = "Radar Graph") 
 # Add a legend
 legend(x=1.2, y=-0.4, legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), bty = "n", pch=20 , col=viridis(4) , text.col = "grey", cex=0.4, pt.cex=1)

par(op)

For better clarity, we have also displayed each cluster individually.

# we need to set the margin and create two rows to display the graphs
op <- par(mar=c(0, 1, 1, 0),mfrow=c(2, 2))

##### Cluster 1

#filter cluster 1 
radar <- clust_h_dnb_2021 %>% 
 filter(cluster == 1)%>%
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.267, green = 0.00392, blue = 0.329,  alpha = 0.5) , pfcol= rgb( red = 0.267, green = 0.00392, blue = 0.329,  alpha = 0.2)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 1", cex.main = 1 )

##### Cluster 2
radar <- clust_h_dnb_2021 %>% 
 filter(cluster == 2) %>% 
 select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.192  , green = 0.40784, blue = 0.557,  alpha = 0.5) , pfcol= rgb( red = 0.192    , green = 0.40784, blue = 0.557,  alpha = 0.5)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 2", cex.main = 1 )


##### Cluster 3 
radar <- clust_h_dnb_2021 %>% 
 filter(cluster == 3) %>% 
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.208  , green = 0.71765, blue = 0.475,  alpha = 0.5) , pfcol= rgb( red = 0.208    , green = 0.71765, blue = 0.475,  alpha = 0.5)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 3", cex.main = 1 )

##### Cluster 4 
radar <- clust_h_dnb_2021 %>% 
 filter(cluster == 4) %>% 
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.992  , green = 0.90588, blue = 0.145,  alpha = 0.5), pfcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145,  alpha = 0.5), plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 4", cex.main = 1  )

par(op)

The cluster analysis show very interesting results. our intuition from the regression analysis is confirmed as the success rate varies between the four cluster. The highest achieving establishment are the one offering a canteen for lunch where the children stay at school playing with their friends. The lowest success rate is the group going home for lunch where each children is separated and each reality greatly differs. Indeed, some students will have a comforting environment at home whereas other will have a hard time at home or at someone else’s home. The difference is also quite marked for some of the distinctions. The achieving rate for the distinction AB and B are rather similar across clusters. Student from an establishment of the fourth cluster are the one achieving the best results with 29.3% of them receiving the distinction TB and only 20.8% not receiving any distinction. The rates are much worse if your school has a boarding offer as 35% of students do not receive any distinction and only 10% get the TB distinction. This could be due that some of them are establishment for elite athletes which start to focus more and more on their sports.

2020

To perform the cluster analysis, we filtered the session 2021 and selected the variables department_fr, session, school_id, external_students_rate, half_boarders_students_rate, boarding_students_rate from the joined data set. We then removed th first three and scaled the rest of the data.

The goal of this cluster analysis is to define clusters relative to the offering of each establishment. Through this analysis, we aim to split the establishments by offering. As we have 3 different offering it would not make sense to do too many clusters. To define the right number of cluster for our kmeans clustering, we used the elbow method. For 2020, the deicision is not as clear as for 2021 and the hesitation is between three and four clusters. As we have four clusters in 2021, we decided to also take four cluster for the 2020 aalysis. The ratio between the between sum of square and the within sum of square is good at 81.1%.


fviz_nbclust(clust, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) + #add line for better visualisation 
  labs(subtitle = "Elbow method")  #We can determine the optimal number of cluster, 4 clusters seems to be reasonable


# Compute k-means
km.res <- kmeans(clust, 4, nstart = 25)

# Visualize clusters using factoextra
fviz_cluster(km.res, clust,
             ggtheme = theme_ipsum(), 
             repel = TRUE)+
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)

We can see that the four clusters are well defined with cluster one, three and four being differiated by the x-axis and cluster two spanning wider on the x-axis and on the y-axis. One can see what the dimensions represent in the graph below. Cluster three and four have higher half boarder rate than group two and especially group one. Group one will represent establishment without any catering offerings for the students. Group two is establishments with at least some of the students in boarding schools.


fviz_pca_var(PCA(clust, graph = FALSE))

To be able to continue our analyisis on the cluster we just measured, we need to implement the data in the main data set. We gathered the cluster data in a new data frame then added the column of cluster to the data set used for the cluster and then joined it with inner_join to the main data set to keep the filter we had applied.


clust1 <- tibble(department_fr = names(km.res$cluster),
                            cluster = km.res$cluster)

clust_housing_dnb$cluster <- clust1$cluster

housing_dnb_2020 <- inner_join(x = housing_dnb, y = clust_housing_dnb)

To analyse the data by cluster we needed to summarise it by cluster. You can see the results in the table below with an added variable establishment which is the number of establishment present in each cluster.

clust_h_dnb_2020 <- housing_dnb_2020 %>% 
  group_by(cluster) %>% 
  summarise(establishment = n(),
            external_students_rate = mean(external_students_rate, na.rm = TRUE),
            half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE), 
            boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>% 
  round(digits = 2)

datatable(clust_h_dnb_2020, options =list(scrollX = "300px"))

We visually represented the results in a spider chart using the radarchart function.

op <- par(mar=c(0, 0, 0, 0))

radar <- clust_h_dnb_2020 %>%
  select(-c(cluster, establishment))

# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
 radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= viridis(4, alpha = 1) , pfcol= viridis(4, alpha = 0.2)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4)#, title = "Radar Graph") 
 # Add a legend
 legend(x=1.2, y=-0.4, legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), bty = "n", pch=20 , col=viridis(4) , text.col = "grey", cex=0.4, pt.cex=1)

par(op)

For better clarity, we have also displayed each cluster individually.

# we need to set the margin and create two rows to display the graphs
op <- par(mar=c(0, 1, 1, 0),mfrow=c(2, 2))

##### Cluster 1

#filter cluster 1 
radar <- clust_h_dnb_2020 %>% 
 filter(cluster == 1)%>%
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.267, green = 0.00392, blue = 0.329,  alpha = 0.5) , pfcol= rgb( red = 0.267, green = 0.00392, blue = 0.329,  alpha = 0.2)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 1", cex.main = 1 )

##### Cluster 2
radar <- clust_h_dnb_2020 %>% 
 filter(cluster == 2) %>% 
 select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.192  , green = 0.40784, blue = 0.557,  alpha = 0.5) , pfcol= rgb( red = 0.192    , green = 0.40784, blue = 0.557,  alpha = 0.5)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 2", cex.main = 1 )


##### Cluster 3 
radar <- clust_h_dnb_2020 %>% 
 filter(cluster == 3) %>% 
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.208  , green = 0.71765, blue = 0.475,  alpha = 0.5) , pfcol= rgb( red = 0.208    , green = 0.71765, blue = 0.475,  alpha = 0.5)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 3", cex.main = 1 )

##### Cluster 4 
radar <- clust_h_dnb_2020 %>% 
 filter(cluster == 4) %>% 
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.992  , green = 0.90588, blue = 0.145,  alpha = 0.5), pfcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145,  alpha = 0.5), plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 4", cex.main = 1  )

par(op)

The cluster analysis show very interesting results. Our intuition from the regression analysis is confirmed as the success rate varies between the four cluster. The highest achieving establishment are the one offering a canteen for lunch where the children stay at school playing with their friends. The lowest success rate is the group going home for lunch where each children is separated and each reality greatly differs. Indeed, some students will have a comforting environment at home whereas other will have a hard time at home or at someone else’s home. The difference is also quite marked for some of the distinctions. The achieving rate for the distinction AB and B are rather similar across clusters. Student from an establishment of the first cluster are the one achieving the best results with 33.3% of them receiving the distinction TB and only 17.9% not receiving any distinction. The rates are much worse if your school has a boarding offer as 26.4% of students do not receive any distinction and only 14.7% get the TB distinction. This could be due that some of them are establishment for elite athletes which start to focus more and more on their sports.

From this analysis, we can conclude that having lunch and sleeping at school or at home does have a small influence on the results of the Dimplome National du Brevet. Better results are achieved students in establishment offering the lunch. Going home for lunch does seem to hinder academic success. Boarding schools might not the best choice for students aiming for top honours.

5.2.3 Single-parent families.

The first analysis is at the national level.We want to use two plot types: a bar plot that represent total single parent families and a line plot to represent the success rate. Both plots use session for x-axis. Since the range of the success_rate plot is much lower than the single parent families plot, the use of a secondary y-axis with an adapted scale is recommended.

sp <- single_parent%>%
  select("session", "department_fr", "sing_par")
singpar_vs_dnb <- left_join(dnb_results_dep, sp, by = c("department_fr","session"))

sp_vs_dnb <- singpar_vs_dnb%>%
  select(success_rate_pct, session, sing_par, department_fr)%>%
  group_by(session)%>%
  summarise(success_rate_pct = mean(success_rate_pct), 
            sing_par = sum (sing_par, na.rm = TRUE)) %>%
  ggplot()+
  geom_col(aes(x = session, y = sing_par)) +
  geom_line(aes(x = session, y = 30000*success_rate_pct), size = 1, color="blue", group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./30000, name = "Success rate")) +
  labs( x = "Session", y = "Single parent families" )+
  theme_ipsum()
sp_vs_dnb

Despite the constant increase in the number of parental families over time, DNB’s results do not appear to be influenced too much by these. Indeed, logically, the more the number of single parent families increase, the more the success rate should decrease. However, from 2007 to 2017, sing_par and success_rate both increased, and from 2017 to 2019, although sing_par continued to increase, the success_rate suffered a drop from 88.74% to 86.14%.

We now want to do an analysis of single parent families against the results of DNB at the departmental level. We will therefore do a linear regression and use singpar_vs_dnb, a dataframe showing the results of DNB and sing_par par department_fr.

ggplot(data = singpar_vs_dnb,
       mapping = aes(x = sing_par, y = success_rate_pct)) +
  labs(title = "Single-parent families VS success rate", x = "sing_par", y = "Success rate", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)+
  theme_ipsum()
#> Warning: Removed 960 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 960 rows containing missing values (`geom_point()`).
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

regression_sp <- lm(data = singpar_vs_dnb, success_rate_pct ~ sing_par)
tab_model(regression_sp)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 85.68 85.24 – 86.12 <0.001
sing par -0.00 -0.00 – -0.00 0.020
Observations 576
R2 / R2 adjusted 0.009 / 0.008

The 34 missing values are the overseas departments, which we did not have the information for success_rate. Rsquared is equal to 0.009, and p_value is equal to 0.0203 but according to the significance code, it is not significant. Since variance is not explained, the model does not fit our data. Therefore, success rate of a department does not depend on the number of single parent families in the department.

5.3 Has the COVID-19 pandemic impacted student performance?

In order to observe the influence of Covid on student performance, we are going to investigate the relationship between incidence rate and success rate of DNB in every department in France. To do so, we first have to do a left join between covid_in_school and dnb_results in order to create a new data frame called covid_vs_dnb, and then create a simple linear regression model.

c <- c("2020", "2021")
d<- dnb_results%>%
  select("session", "department_fr", "success_rate_pct")%>%
  filter(session %in% c)%>%
  group_by(department_fr, session)%>%
  summarise(success_rate_pct = mean(success_rate_pct))
co<- covid_in_schools%>%
  select("session", "department_fr", "incidence_rate")%>%
  group_by(department_fr, session)%>%
  summarise(incidence_rate = mean(incidence_rate))
covid_vs_dnb <- inner_join(d, co, by = c("session", "department_fr"))
ggplot(data = covid_vs_dnb,
       mapping = aes(x = incidence_rate, y = success_rate_pct)) +
  labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)

regression <- lm(data = covid_vs_dnb, success_rate_pct ~ incidence_rate)
tab_model(regression)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 90.12 89.71 – 90.53 <0.001
incidence rate -0.01 -0.02 – -0.01 <0.001
Observations 192
R2 / R2 adjusted 0.220 / 0.216

According to the residual results, the regression is more precise for the high values than the low values. The p-value is close to 0, therefore significant. Then, the model seems usable. However, R squared is around 20% of the variation within the incidence rate, meaning it explains only 20% of the variation. Therefore, the result here is not very conclusive.

We then tried to do a linear regression for 2020 and 2021 separately because we thought that the 2020 low values could have distort our model.

2020
covid_vs_dnb2020 <- inner_join(d, co, by = c("session", "department_fr"))%>%
  filter(session ==2020)
ggplot(data = covid_vs_dnb2020,
       mapping = aes(x = incidence_rate, y = success_rate_pct)) +
  labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
regression2020 <- lm(data = covid_vs_dnb2020, success_rate_pct ~ incidence_rate)
tab_model(regression2020)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 90.12 89.60 – 90.64 <0.001
incidence rate 0.01 -0.17 – 0.20 0.880
Observations 96
R2 / R2 adjusted 0.000 / -0.010

2021
covid_vs_dnb2021 <- inner_join(d, co, by = c("session", "department_fr"))%>%
  filter(session ==2021)
ggplot(data = covid_vs_dnb2021,
       mapping = aes(x = incidence_rate, y = success_rate_pct)) +
  labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
regression2021 <- lm(data = covid_vs_dnb2021, success_rate_pct ~ incidence_rate)
tab_model(regression2021)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 89.51 88.00 – 91.02 <0.001
incidence rate -0.01 -0.02 – -0.00 0.046
Observations 96
R2 / R2 adjusted 0.042 / 0.031

From these regression results, we cannot use these models to explain the influence of the incidence rate and the success rate. Consequently, covid seems to not have a big impact on the dnb results. This finding is not so surprising because of the implementation of continuous assessment. In fact, according to the article by l’Etudiant (Cojean, 2020), the French government has taken special measures for DNB exams following Covid: since 2020, graduates were awarded on the basis of continuous assessment, so no final tests.

Since our linear regressions were inconclusive, the next step would be to observe whether or not covid influenced the attribution of the distinction. It can be hypothesized that given the continuous assessment, the rating has been more lenient. This would therefore lead to better results, thus more students admitted or with mentions. In view of the results for 2020, which did not vary enormously, we will mainly focus on the attribution of distinction in 2021.To do so, we create covid_vs_dnbmention.

m<- dnb_results%>%
  select("session", "department_fr", "TB_pct", "B_pct", "AB_pct", "admitted_without")%>%
  filter(session == 2021)%>%
  group_by(department_fr, session)%>%
  summarise(TB_pct=mean(TB_pct), B_pct=mean(B_pct), AB_pct=mean(AB_pct), admitted_without=mean(admitted_without))
co<- covid_in_schools%>%
  select("session", "department_fr", "incidence_rate")%>%
  group_by(department_fr, session)%>%
  summarise(incidence_rate = mean(incidence_rate))
covid_vs_dnbmention <- inner_join(m, co, by = c("session", "department_fr"))
Linear regression for each distinction
TB
ggplot(data = covid_vs_dnbmention,
         mapping = aes(x = incidence_rate, y = TB_pct)) +
  labs(title = 'Incidence-rate of Covid cases VS Distinction "Très Bien"', x = "Incidence-rate", y = "Percentage of TB", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, TB_pct ~ incidence_rate)
tab_model(regression)
  TB_pct
Predictors Estimates CI p
(Intercept) 20.21 17.70 – 22.73 <0.001
incidence rate 0.02 0.00 – 0.04 0.016
Observations 94
R2 / R2 adjusted 0.062 / 0.052

B
ggplot(data = covid_vs_dnbmention,
         mapping = aes(x = incidence_rate, y = B_pct)) +
  labs(title = 'Incidence-rate of Covid cases VS Distinction "Bien"', x = "Incidence-rate", y = "Percentage of B", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, B_pct ~ incidence_rate)
tab_model(regression)
  B_pct
Predictors Estimates CI p
(Intercept) 25.68 24.36 – 27.00 <0.001
incidence rate -0.00 -0.01 – 0.00 0.281
Observations 94
R2 / R2 adjusted 0.013 / 0.002

AB
ggplot(data = covid_vs_dnbmention,
         mapping = aes(x = incidence_rate, y = AB_pct)) +
  labs(title = 'incidence-rate of Covid cases VS Distinction "Assez Bien"', x = "Incidence-rate", y = "Percentage of AB", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, AB_pct ~ incidence_rate)
tab_model(regression)
  AB_pct
Predictors Estimates CI p
(Intercept) 28.59 27.14 – 30.05 <0.001
incidence rate -0.02 -0.03 – -0.01 0.002
Observations 94
R2 / R2 adjusted 0.099 / 0.089

Without Distinction
ggplot(data = covid_vs_dnbmention,
       mapping = aes(x = incidence_rate, y = admitted_without)) +
  labs(title = "incidence-rate of Covid cases VS Admitted Without Distinction", x = "Incidence-rate", y = "Percentage of students admitted without distinction", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
regression <- lm(data = covid_vs_dnbmention, admitted_without ~ incidence_rate )
tab_model(regression)
  admitted_without
Predictors Estimates CI p
(Intercept) 9.87 6.89 – 12.85 <0.001
incidence rate 0.05 0.03 – 0.07 <0.001
Observations 96
R2 / R2 adjusted 0.204 / 0.196

In view of the various linear regressions, as covid cases increased, the academic government seemed to have granted more easily the highest distinction and without distinction. In fact, the linear regression of distinction TB and admitted_without indicate moderate positive correlations. This therefore leads to think that the grading was more lenient for the departments where covid was very present. Nevertheless, it is difficult to interpret these results as they are statistically not significant and all Rsquared are extremely poor.

Since the linear regression does not allow us to say much about the influence of Covid on student success, we decided to do an analysis from clusters and see if we can get anything interesting results out of it.

cluster1<- covid_vs_dnb2021
cluster <- cluster1[-c(1,2)]
row.names(cluster) <- as.vector(t(cluster1[,1]))
cluster <- scale(cluster)


fviz_nbclust(cluster, kmeans, method = "wss") +
  geom_vline(xintercept = 5, linetype = 2) +
  labs(subtitle = "Elbow method")
km.res1 <- kmeans(cluster, 5, nstart = 25) #77.4%
fviz_cluster(km.res1, cluster,
             ggtheme = theme_ipsum(),
             repel = TRUE)+
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)

We see on the plot that the success rates do not differ too much between the clusters despite the difference of incidence rate. So, as seen in linear regression, covid cases did not affect dnb results.

p <- clust_cov_dnb %>% 
  filter(cluster == 1) %>% 
  ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    labs(title = "Cluster 1", x = "Session", y = "Rate in %" ) +
    theme_ipsum()

ggplotly(p, tooltip = c("x" ,"y"))
p <- clust_cov_dnb %>% 
  filter(cluster == 3) %>% 
  ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    labs(title = "Cluster 3", x = "Session", y = "Rate in %" ) +
    theme_ipsum()

ggplotly(p, tooltip = c("x" ,"y"))

5.4 essai

Regression template

#> 
#> Call:
#> lm(formula = success_rate_pct ~ external_students_rate + half_boarders_students_rate + 
#>     boarding_students_rate, data = housing_dnb)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -5.072 -1.570  0.061  1.523  6.073 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                  95.6941     4.6987   20.37   <2e-16 ***
#> external_students_rate       -0.1034     0.0437   -2.37    0.019 *  
#> half_boarders_students_rate  -0.0576     0.0532   -1.08    0.280    
#> boarding_students_rate            NA         NA      NA       NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.26 on 189 degrees of freedom
#> Multiple R-squared:  0.0726, Adjusted R-squared:  0.0628 
#> F-statistic:  7.4 on 2 and 189 DF,  p-value: 0.000805
#> 
#> Call:
#> lm(formula = success_rate_pct ~ external_students_rate, data = housing_dnb)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -5.039 -1.541  0.048  1.501  7.130 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             90.6304     0.4484  202.11  < 2e-16 ***
#> external_students_rate  -0.0594     0.0161   -3.69  0.00029 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.26 on 190 degrees of freedom
#> Multiple R-squared:  0.0669, Adjusted R-squared:  0.062 
#> F-statistic: 13.6 on 1 and 190 DF,  p-value: 0.000293

cluster template

#> [1] 0.78
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning: ggrepel: 70 unlabeled data points (too many overlaps).
#> Consider increasing max.overlaps
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

regression Essai

lm1 <- lm(dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + dnb_results$AB_pct + dnb_results$without_pct)
summary(lm1)
#> 
#> Call:
#> lm(formula = dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + 
#>     dnb_results$AB_pct + dnb_results$without_pct)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -1.71e-09  0.00e+00  0.00e+00  0.00e+00  2.55e-11 
#> 
#> Coefficients:
#>                          Estimate Std. Error   t value Pr(>|t|)    
#> (Intercept)              1.00e+02   1.17e-13  8.53e+14   <2e-16 ***
#> dnb_results$without_pct -1.00e+00   1.26e-15 -7.95e+14   <2e-16 ***
#> dnb_results$B_pct       -1.00e+00   2.25e-15 -4.44e+14   <2e-16 ***
#> dnb_results$AB_pct      -1.00e+00   1.50e-15 -6.66e+14   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.74e-12 on 135154 degrees of freedom
#>   (29 observations deleted due to missingness)
#> Multiple R-squared:     1,   Adjusted R-squared:     1 
#> F-statistic: 3.04e+29 on 3 and 135154 DF,  p-value: <2e-16

cluster Essai

dnb_pct_dep <- dnb_results %>%
  group_by(department, session) %>% 
  summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
            B_pct_dep = mean(B_pct, na.rm = TRUE),
            TB_pct_dep = mean(TB_pct, na.rm = TRUE),
            without_pct_dep = mean(without_pct, na.rm = TRUE),
            success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))

pairs(dnb_pct_dep[2:6])

distance <- dist(dnb_pct_dep)
#> Warning in dist(dnb_pct_dep): NAs introduced by coercion

mydata.hclust <- hclust(distance)
plot(mydata.hclust)

  • Answers to the research questions
  • Different methods considered
  • Competing approaches
  • Justifications

6 Conclusion

  • Take home message
  • Limitations
  • Future work?